AITopics | evaluation practice

Collaborating Authors

evaluation practice

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Neither Valid nor Reliable? Investigating the Use of LLMs as Judges

Chehbouni, Khaoula, Haddou, Mohammed, Cheung, Jackie Chi Kit, Farnadi, Golnoosh

arXiv.org Artificial IntelligenceAug-29-2025

Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by the rise of large language models (LLMs) that aims to be general-purpose. Recently, large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the inherent limitations of LLMs, LLJs, or current practices in NLG evaluation. To ground our analysis, we explore three applications of LLJs: text summarization, data annotation, and safety alignment. Finally, we highlight the need for more responsible evaluation practices in LLJs evaluation, to ensure that their growing role in the field supports, rather than undermines, progress in NLG.

computational linguistic, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.18076

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Massachusetts (0.28)

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine (0.46)
Law (0.46)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?

Prandi, Matteo, Suriani, Vincenzo, Pierucci, Federico, Galisai, Marcello, Nardi, Daniele, Bisconti, Piercosma

arXiv.org Artificial IntelligenceAug-11-2025

The rapid advancement of General Purpose AI (GPAI) models necessitates robust evaluation frameworks, especially with emerging regulations like the EU AI Act and its associated Code of Practice (CoP). Current AI evaluation practices depend heavily on established benchmarks, but these tools were not designed to measure the systemic risks that are the focus of the new regulatory landscape. This research addresses the urgent need to quantify this "benchmark-regulation gap." We introduce Bench-2-CoP, a novel, systematic framework that uses validated LLM-as-judge analysis to map the coverage of 194,955 questions from widely-used benchmarks against the EU AI Act's taxonomy of model capabilities and propensities. Our findings reveal a profound misalignment: the evaluation ecosystem dedicates the vast majority of its focus to a narrow set of behavioral propensities. On average, benchmarks devote 61.6% of their regulatory-relevant questions to "Tendency to hallucinate" and 31.2% to "Lack of performance reliability", while critical functional capabilities are dangerously neglected. Crucially, capabilities central to loss-of-control scenarios, including evading human oversight, self-replication, and autonomous AI development, receive zero coverage in the entire benchmark corpus. This study provides the first comprehensive, quantitative analysis of this gap, demonstrating that current public benchmarks are insufficient, on their own, for providing the evidence of comprehensive risk assessment required for regulatory compliance and offering critical insights for the development of next-generation evaluation tools.

benchmark, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2508.05464

Country: North America > United States (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.87)

Industry:

Government (1.00)
Law > Statutes (0.87)
Information Technology > Security & Privacy (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SPHERE: An Evaluation Card for Human-AI Systems

Ma, Qianou, Zhao, Dora, Zhao, Xinran, Si, Chenglei, Yang, Chenyang, Louie, Ryan, Reiter, Ehud, Yang, Diyi, Wu, Tongshuang

arXiv.org Artificial IntelligenceApr-14-2025

In the era of Large Language Models (LLMs), establishing effective evaluation methods and standards for diverse human-AI interaction systems is increasingly challenging. To encourage more transparent documentation and facilitate discussion on human-AI system evaluation design options, we present an evaluation card SPHERE, which encompasses five key dimensions: 1) What is being evaluated?; 2) How is the evaluation conducted?; 3) Who is participating in the evaluation?; 4) When is evaluation conducted?; 5) How is evaluation validated? We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement. We provide three recommendations for improving the validity and rigor of evaluation practices.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2504.07971

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre:

Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (0.94)

Industry:

Health & Medicine (1.00)
Education (1.00)
Media (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)

Add feedback

Scenario-based Evaluation of Prediction Models for Automated Vehicles

Sánchez, Manuel Muñoz, Elfring, Jos, Silvas, Emilia, van de Molengraft, René

arXiv.org Artificial IntelligenceOct-11-2022

To operate safely, an automated vehicle (AV) must anticipate how the environment around it will evolve. For that purpose, it is important to know which prediction models are most appropriate for every situation. Currently, assessment of prediction models is often performed over a set of trajectories without distinction of the type of movement they capture, resulting in the inability to determine the suitability of each model for different situations. In this work we illustrate how standardized evaluation methods result in wrong conclusions regarding a model's predictive capabilities, preventing a clear assessment of prediction models and potentially leading to dangerous on-road situations. We argue that following evaluation practices in safety assessment for AVs, assessment of prediction models should be performed in a scenario-based fashion. To encourage scenario-based assessment of prediction models and illustrate the dangers of improper assessment, we categorize trajectories of the Waymo Open Motion dataset according to the type of movement they capture. Next, three different models are thoroughly evaluated for different trajectory types and prediction horizons. Results show that common evaluation methods are insufficient and the assessment should be performed depending on the application in which the model will operate.

artificial intelligence, machine learning, trajectory, (18 more...)

arXiv.org Artificial Intelligence

2210.06553

Country:

Europe > Netherlands > North Brabant > Eindhoven (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Transportation > Ground > Road (0.94)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

Design Guidelines for Inclusive Speaker Verification Evaluation Datasets

Hutiri, Wiebke Toussaint, Gorce, Lauriane, Ding, Aaron Yi

arXiv.org Artificial IntelligenceSep-13-2022

Speaker verification (SV) provides billions of voice-enabled devices with access control, and ensures the security of voice-driven technologies. As a type of biometrics, it is necessary that SV is unbiased, with consistent and reliable performance across speakers irrespective of their demographic, social and economic attributes. Current SV evaluation practices are insufficient for evaluating bias: they are over-simplified and aggregate users, not representative of real-life usage scenarios, and consequences of errors are not accounted for. This paper proposes design guidelines for constructing SV evaluation datasets that address these short-comings. We propose a schema for grading the difficulty of utterance pairs, and present an algorithm for generating inclusive SV datasets. We empirically validate our proposed method in a set of experiments on the VoxCeleb1 dataset. Our results confirm that the count of utterance pairs/speaker, and the difficulty grading of utterance pairs have a significant effect on evaluation performance and variability. Our work contributes to the development of SV evaluation practices that are inclusive and fair.

dataset, evaluation, utterance pair, (14 more...)

arXiv.org Artificial Intelligence

2204.02281

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Canada (0.05)
Asia > India (0.05)
(9 more...)

Genre: Research Report > New Finding (0.69)

Industry:

Law (0.46)
Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.87)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.63)

Add feedback

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Gehrmann, Sebastian, Clark, Elizabeth, Sellam, Thibault

arXiv.org Artificial IntelligenceFeb-14-2022

Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted. This issue has become more urgent, since neural NLG models have improved to the point where they can often no longer be distinguished based on the surface-level features that older metrics rely on. This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years. We summarize, categorize, and discuss how researchers have been addressing these issues and what their findings mean for the current state of model evaluations. Building on those insights, we lay out a long-term vision for NLG evaluation and propose concrete steps for researchers to improve their evaluation processes. Finally, we analyze 66 NLG papers from recent NLP conferences in how well they already follow these suggestions and identify which areas require more drastic changes to the status quo.

cracked foundation, evaluation practice, generated text, (2 more...)

arXiv.org Artificial Intelligence

2202.06935

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Generation (0.53)

Add feedback